Search CORE

6 research outputs found

Astaroth: Ohjelmistokirjasto stensiililaskentaan grafiikkasuorittimilla

Author: Pekkilä Johannes
Publication venue
Publication date: 17/06/2019
Field of study

Graphics processing units (GPUs) are coprocessors, which offer higher throughput and better power efficiency than central processing units in dataparallel tasks. For this reason, graphics processors provide a good platform for high-performance computing. However, programming GPUs such that all the available performance is utilized requires in-depth knowledge of the architecture of the hardware. Additionally, the problem of high-order stencil computations on GPUs in challenging multiphysics applications has not been adequately explored in previous work. In this thesis, we address these issues by presenting a library, an efficient algorithm and a domain-specific language for solving stencil computations within a structured grid. We tested our implementation by simulating magnetohydrodynamics, which involved the computation of first, second, and cross partial derivatives using second-, fourth-, sixth-, and eight-order finite differences with single and double precision. The running time of our integration kernel was 2.8–9.1 times slower than the theoretical minimum time, which it would take to read the computational domain and write it back to device memory exactly once, without taking into account the effects of finite caches or arithmetic operations on performance. Additionally, we made a performance comparison with a CPU solver widely used for scientific computations, which we benchmarked on a total of 24 cores of two Intel Xeon E5-2690 v3 processors. Our solver, benchmarked on a Tesla P100 PCIe GPU, outperformed the CPU solver by factors of 6.7 and 10.4 when using single and double precision, respectively.Grafiikkasuorittimet ovat apusuorittimia, jotka tarjoavat rinnakkain laskettavissa tehtävissä parempaa suoritus- ja energiatehokkuutta kuin keskussuorittimet. Tästä syystä grafiikkasuorittimet tarjoavat hyvän alustan suurteholaskennan tarpeisiin. Toisaalta grafiikkasuorittimen ohjelmointi siten, että kaikki tarjolla oleva suorituskyky saadaan hyödynnettyä, vaatii syvällistä asiantuntemusta ohjelmoitavan laitteiston arkkitehtuurista. Korkean asteen stensiililaskentaa haastavissa fysiikkasovelluksissa ei ole myöskään tutkittu laajalti aiemmissa julkaisuissa. Tässä työssä otamme kantaa näihin ongelmiin esittelemällä ohjelmistokirjaston, tehokkaan algoritmin, sekä tehtävään räätälöidyn ohjelmointikielen stensiililaskujen ratkaisemiseen säännöllisessä hilassa. Testasimme toteutustamme simuloimalla magnetohydrodynamiikkaa, johon kuului ensimmäisen ja toisen kertaluvun derivaattojen lisäksi ristiderivaattojen ratkaisutoisen, neljännen, kuudennen ja kahdeksannen kertaluvun differenssimenetelmällä käyttäen sekä 32- että 64-bittisiä liukulukuja. Integrointifunktiomme suoritusaika oli 2.8–9.1 kertaa hitaampi kuin teoreettinen vähimmäisajoaika, joka menisi laskennallisen alueen lukemiseen ja kirjoittamiseen apusuorittimen muistista täsmälleen kerran, ottamatta huomioon äärellisen välimuistin tai laskennan vaikutusta suoritusaikaan. Vertasimme kirjastomme suoritusaikaa laajalti tieteellisessä laskennassa käytettyyn keskussuorittimille tarkoitettuun ratkaisijaan, jonka ajoimme kokonaisuudessaan 24:llä ytimellä kahdella Intel Xeon E5-2690 v3 -suorittimella. Tähän ratkaisijaan verrattuna Tesla P100 PCIe -grafiikkasuorittimella ajettu ratkaisijamme oli 6.7 ja 10.4 kertaa nopeampi 32- ja 64-bittisillä liukuluvuilla laskettaessa, tässä järjestyksessä

Aaltodoc Publication Archive

Scalable communication for high-order stencil computations using CUDA-aware MPI

Author: Käpylä Maarit J.
Lappi Oskar
Pekkilä Johannes
Rheinhardt Matthias
Väisälä Miikka S.
Publication venue
Publication date: 02/03/2021
Field of study

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to

64

devices at

50\%

87\%

efficiency in sixth-order stencil computations when the problem domain consists of

256^3

1024^3

cells.Comment: 17 pages, 15 figure

arXiv.org e-Print Archive

Aaltodoc Publication Archive

MPG.PuRe

Interaction of large- and small-scale dynamos in isotropic turbulent flows from GPU-accelerated simulations

Author: Krasnopolsky Ruben
Käpylä Maarit J.
Pekkilä Johannes
Rheinhardt Matthias
Shang Hsien
Väisälä Miikka S.
Publication venue: 'American Astronomical Society'
Publication date: 16/12/2020
Field of study

Magnetohydrodynamical (MHD) dynamos emerge in many different astrophysical situations where turbulence is present, but the interaction between large-scale (LSD) and small-scale dynamos (SSD) is not fully understood. We performed a systematic study of turbulent dynamos driven by isotropic forcing in isothermal MHD with magnetic Prandtl number of unity, focusing on the exponential growth stage. Both helical and non-helical forcing was employed to separate the effects of LSD and SSD in a periodic domain. Reynolds numbers (Rm) up to

\approx 250

were examined and multiple resolutions used for convergence checks. We ran our simulations with the Astaroth code, designed to accelerate 3D stencil computations on graphics processing units (GPUs) and to employ multiple GPUs with peer-to-peer communication. We observed a speedup of

\approx 35

in single-node performance compared to the widely used multi-CPU MHD solver Pencil Code. We estimated the growth rates both from the averaged magnetic fields and their power spectra. At low Rm, LSD growth dominates, but at high Rm SSD appears to dominate in both helically and non-helically forced cases. Pure SSD growth rates follow a logarithmic scaling as a function of Rm. Probability density functions of the magnetic field from the growth stage exhibit SSD behaviour in helically forced cases even at intermediate Rm. We estimated mean-field turbulence transport coefficients using closures like the second-order correlation approximation (SOCA). They yield growth rates similar to the directly measured ones and provide evidence of

\alpha

quenching. Our results are consistent with the SSD inhibiting the growth of the LSD at moderate Rm, while the dynamo growth is enhanced at higher Rm.Comment: 22 pages, 23 figures, 2 tables, Accepted for publication in the Astrophysical Journa

arXiv.org e-Print Archive

Aaltodoc Publication Archive

MPG.PuRe